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Abstract — Adding similar features and bug fixes often requires 
porting program patches from reference implementations and 
adapting them to target implementations. Porting errors may 
result from faulty adaptations or inconsistent updates. This paper 
investigates (1) the types of porting errors found in practice, 
and (2) how to detect and characterize potential porting errors. 
Analyzing version histories, we define five categories of porting 
errors, including incorrect control- and data-flow, code redun- 
dancy, inconsistent identifier renamings, etc. Leveraging this 
categorization, we design a static control- and data-depen deuce 
analysis technique, SPA, to detect and characterize porting 
inconsistencies. Our evaluation on rode from four open-source 
projects shows that SPA can detect porting inconsistencies with 
65% to 75% precision and 90% recall, and identify inconsistency 
types with 58% to 63% precision and 92% to 1U0% recall. In a 
comparison with two existing error detection tools, SPA improves 
precision by 14 to 17 percentage points. 

L Introduction 

Developers often port code from one implementation to 
another in order to implement similar features or bug fixes, 
A recent case study of OpenBSD, NctBSD, and FreeBSD 
found that 11% to 16% code changes are ported from peer 
projects [ I Sj. Also, when libraries and frameworks evolve their 
APIs, client applications make similar updates to use the new 
APIs correctly [3J. In a large code base, typically 10% Lo 
30% of (he code is considered as code clones [II], which 
often require similar updates during software evolution [13]. 
When porting changes from one implementation to another, 
developers generally need to adapt the ported changes to fit 
the new context. The code in the reference often serves as 
a template that is pasted into the target implementation, and 
then later adapted 1 12|. 

The process of adapting a change to fit another context 
can be error-prone, often resulting in porting errors. Chou 
et al. report that a significant portion of operating system 
bugs comes from ported edits [4]. In a case study of clone 
related bugs, Juergens et aL discover that 4 nearly every 
second, unintentional inconsistent changes to clones lead to 
a fault 4 110]. Li ct aL identify errors in Linux and FreeBSD 
resulting from developers forgetting to rename identifiers after 
porting code [I5J. Jiang ct al. [9] present evidence of porting 
errors when similar code appears in different contexts. Porting 
errors can also happen when developers evolve ported code 
differently [6], [10]. 

When developers port code from a reference to a target con- 
text, they usually expect the ported code to behave similarly. 


Existing tool support lor detecting semantic inconsistencies in 
purled code is limited. For example, Li el al. and Juergens 
et al. find inconsistent clones using a lexical clone detection 
analysis [10], [15]. Jiang el al. and Gabel et al. report clone 
related bugs by comparing the syntax tree structures for two 
clones [6], [9]. Such syntactic and lexical analyses are not 
sufficient to detect the semantic inconsistencies arising from 
updates to the ported code in different contexts. 

The goal of this work is to assist developers in porting 
edits from one context io another, by detecting semantic 
inconsistencies that may indicate a porting error. As a first 
step towards this goal, we sLudy the extent and characteristics 
of porting errors that occur in practice to better understand 
the types of porting errors and their fixes. In our study, 
we work backwards by first mining the version histories of 
Linux and FreeBSD to detect commit messages containing 
porting error related keywords. We then analyze three types 
of source code commits — fix -inducing, error-inducing, and 
reference — and their corresponding patches. A patch is the set 
of program statements that are added, deleted, or modified in 
a program version with respect to its previous version. Note 
that modified statements can also be represented as deleted 
statements in the old version and added statements in the 
new version. We use Sliverski et al.’s fix -inducing change 
identification method [21] lo identify the patch that originally 
introduced the porting error. We then use Repertoire [18] lo 
find a reference patch that served as the template for the error- 
inducing patch, Through manual investigation of the reference 
patch, the error-inducing patch, and the fix patch, we find that 
many of the porting errors result from incorrect adaptation of 
the ported code, including inconsistent identifier renamings, 
different control- and data-flow contexts in the reference and 
target implementations, and code redundancy. 

Leveraging this characterization of porting errors, we design 
and implement SPA, an algorithm Lo detect and characterize 
porting inconsistencies, spa delects semantic inconsistencies 
that arise due lo the interactions between program statements 
in the ported code and program statements surrounding Ihe 
ported code, spa takes two code patches as input; a reference 
patch (Refold and and a target patch (Tar^ and 

Tar n[VH ,). SPA analyzes the reference and target patches to 
identify the ported code, and then uses static control- and data- 
dcpcndcncc analyses to identify the impact of the ported code 
on the reference and target contexts. Finally, SPA compares the 



impact of the ported code on the reference and targe! semantics 
to delect and characterize porting inconsistencies. 

To evaluate the accuracy of spa, we perform an empirical 
evaluation on four large open-source codebases; FreeBSD, 
Linux, Eclipse CDT, and Mozilla, and compare the results with 
two state-of-the-art tools, DejaVu [61 and Jiang et al.’s clone 
related hug detection tool [9]. The results of our study show 
that SPA identifies semantic porting inconsistencies with 65% 
to 73% precision and 90% recall and identifies inconsistency 
types with 58% to 63% precision and 92% to 100% recall. SPA 
outperforms two related error detection tools with a precision 
improvement of 14 to 17 percentage points. 

We make the following contributions; 

• We conduct a comprehensive study of the extent and 
characteristics of porting errors reported for real-world 
systems. We identify categories of common porting errors 
related to inconsistent control flow, inconsistent data flow, 
inconsistent identifier renaming, and code redundancy. 

• Leveraging information about commonly found porting 
errors, we design and implement a novel algorithm, SPA, 
to detect potential porting errors based on inconsistent 
semantics of ported code between the reference and target 
contexts. 

• We conduct an empirical evaluation of SPA's ability Lo 
detect and characterize porting inconsistencies in four 
large open-source codebases. 

The rest of the paper is organized as follows. Section II 
discusses an empirical study of porting errors in Linux and 
FreeBSD. Section HI discusses SPA’s methodology for de- 
tecting and characterizing porting inconsistencies. Section IV 
presents an empirical evaluation of SPA’s capability to delect 
and characterize porting inconsistencies. Section V discusses 
related work. Finally, Section VI summarizes our work and 
directions for future work. 

II. An Empirical Study of Porting Errors 

We conduct an empirical study of porting errors documented 
in real world projects to better understand the extent and 
characteristics of porting errors found in practice. In this study, 
we focus on porting errors that arise when porting a patch to 
a similar, but not identical, context within the same project. 
We first identify porting errors that are reported and fixed by 
developers using the version histories from two large, o pen- 
sou rcc projects. We then manually analyze these errors to 
understand the characteristics of the errors as well as Lhc 
fixes. Most of the errors found in the artifacts used m our 
study can largely be characterized into five categories. In the 
remainder of this section, we present the study setup, results, 
and a description of the five categories of porting errors, We 
first define several key terms used in this work. 

Definition 2.1: A program patch , p : = A(fJi 3 Ua), is the 
set of syntactic program differences between two program 
versions, and v-j, where each clement in the set is an atomic 
program statement that corresponds to an edit operation, e.g., 
insert, delete, move, and update. 


Definition 2.2: Ported code is a pair of atomic program 
statements s f . and s* in patches p f and p^ respectively, such 
that s r and s t arc syntactically similar and are also edited 
similarly. 

Definition 2.3: Context of ported code is the set of program 
statements in a method that are not pan of the poned code. 

A. Study Method 

We mine the commit logs and analyze version histories for 
Linux and FreeBSD, Table I shows the size of the two projects 
in KLOC, the evolution period under study, and the number 
of unique developers who made commits during that period. 

Developers often document fixes to porting errors in com- 
mit messages, To detect how many bug fixes are related to 
porting, we find commit logs that contain at least one porting 
related keyword: copy, cut, paste, or porting, and at 
least one error related keyword: error, bug, mistake, 
fix, or defect. A sample commit message in FreeBSD 
is “Fix cutSpaste bug which would result in 
a panic ” The corresponding code patch fixes lhc port- 
ing error. 

To understand Lhc nature of porting errors, wc work back- 
wards from a porting error fix by extracting three patches: 
(a) the fix patch, pj, where the porting error is fixed, (b) the 
target patch, p t , where the porting error is introduced into the 
codebase, and (c) the reference patch, p rY which contains edits 
that serve as the template for the ported code. A fix patch pj is 
the program patch associated with the mined commit message. 
For example, the fix patch corresponding to the commit 
message shown above, is represented by the colored lines in 
the 1R-I example in Tabic II. From the program locations 
edited in py , we use evs annotate or git blame, Lo 
identify the target patch, which introduced the porting error. 
This process is similar to how Sliwerski cl at. [21 J identify 
a fix-inducing patch. We then use the Repertoire tool to 
identify a set of candidate reference patches that may serve as 
the template for the target patch p t [19]. The reference patch, 
by definition, has a commit date prior to the revision date of 
a target patch; hence, we consider patches available until the 
target patch date as candidate reference patches. Finally, we 
select the reference patch, p r , through a manual inspection of 
the possible candidates. For example, in the IR-1 example in 
Table II where Lhc developer forgot to update an identifier bp 
to rabp after porting code fragments from the reference patch, 
we expect the reference patch lo contain the unaltered code 
fragment related to bp. When multiple patches contain similar 
unaltered code fragments, we select a patch with the maximum 
number of similar lines. 

B. Porting Errors Characterization 

In our study we were able to identify 1 1 3 and 1 82 porting 
errors documented in FreeBSD and Linux version histories 
over the course of 18 years and 3 years respectively. Based 
on the porting errors analyzed in our study, wc were able to 
classify the errors into five different categories. Wc use the 



code snippets in Table II to discuss each of the categories of 
porting errors below. 

TABLE 1 

Study Subjects 

KLGC developers years 

Linux 14,998 6,839 3 

FreeBSD 4,479 405 18 

ICF: Inconsistent Control Flow. Many porting errors arise 
from edits that are ported to a different control llow context 
and are not adapted correctly with respect to the context. In 
the ICF example shown in Table II, there is an extra for 
loop, highlighted in gray , in the reference context. Thus, 
the continue statement in the reference code is intended to 
match the inner for loop. Tn the target context, however, 
there is only one for loop. Thus, the continue statement 
(marked in red) on intentionally matches the wrong for loop. 
The corresponding fix patch removes the continue state- 
ment in the target context to fix the error. 

IR: Inconsistent Renaming. Developers often forget to adapt 
variable, type, and constant names according to the target 
context and these inconsistent renamings lead to porting errors. 
This type of porting error is further split into two sub- 
eategories: 

IR-I: Inconsistent renamings of identifiers. Developers re- 
name some occurrences of an identifier i f but forget lo update 
all occurrences of the identifier i consistently. For example, 
pointer bp is updated to pointer rabp three times, missing 
the instances marked in red in the IR- 1 example in Table n. 

IR-2: Inconsistent renamings of related identifiers. Develop- 
ers consistently rename an identifier, but forget to update ail 
related identifiers. In the IR-2 example in Table II, ail instances 
of the OFDM related macro I WL_F I RS T_OFDM_RATE are 
updated to the CCK related macro twt._ftrst_cck- 
_RATE. However, the variable of dm and the related macro 
1 o we s t_pr e s ent_a f dm are not updated to eck and the 
related macro lowest_preeent_cck. The corresponding 
fix patch replaces the token of din with the token cck to fix 
this error. 

IDF: Inconsistent Data Flow* This inconsistency occurs when 
developers mistakenly insert code to a different data initializa- 
tion context. In the IDF example in Table 11, the first argument 
of the Etrcmp method optarg is initialized differently in the 
reference and target edits, opt arg is an environment variable 
initialized by the get opt ] call that parses the command 
line arguments and stores the next argument lo optarg. 
Hence, the function call get opt ! and the use of variable 
optarg should occur as a pain In the reference context, 
optarg is used after get opt 1 and thus is initialized 
properly. In the target context, however, there is no call to 
get opt ; . Thus, optarg is not initialized properly. 

RDN: Redundant operations* Developers may inadvertently 
introduce redundant operations when they port code to the 
wrong place, e.g., where it already performs the same opera- 
tion, or they may not update ported edits correctly to ensure 
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Pig. I. Relationship between different types of porting errors 

there are no redundant computations in the target context. 
In the RDN example in Table II, a code fragment related 
to memepy was ported to the same function body twice under 
the same scope in FreeBSD. The corresponding patch removes 
memepy and the buffer ini tializa lion statements to eorrecl 
the redundant operations. 

OTH: Others* Other porting errors we identified include 
incorrect formatting, c.g., indentation, that docs not match with 
the rest of the target code structure, or unadapted comments 
that do not describe the target code correctly. For example, 
in FreeBSD file sre/ sys /geom/ stripe/g_s tripe h, 
version 13, a comment related to “Concat Name” was 
updated not lo “Stripe Name 5 '. 

C. Distribution of Porting Errors in FreeBSD and Linux 
TABLE m 

Distribution of porting errors 



ICF 

IK 

IDF 

RDN 

OT11 

Total 

Linux 

23 

74 

26 

47 

25 

182 


12.64% 

40.66% 

14.29% 

25.82% 

13.74% 


FreeBSD 

9 

54 

32 

14 

28 

113 


7.96%' 

47.78% 

28.31%. 

12.39% 

24.78% 



By manually inspecting the sets of reference patch, p r , 
target patch, p t , porting error fix patch, p/, associated commit 
messages, and bug descriptions, we categorize the porting 
errors into the five categories described above. Table III shows 
a distribution of the 1 13 cases of FreeBSD and 182 cases of 
Linux across the five categories of porting errors. The results 
show that a majority of porting errors are due lo inconsistent 
renaming of identifiers (IR) — 47.78% and 40.66% in FreeBSD 
and Linux respectively. The errors related lo control (ICF) and 
data (IDF) flow inconsistency make up more than 25% of the 
total porting errors. The rest of the errors are either due lo 
redundant operations (RDN) — 1239% and 25.82%, or wrong 
indentation and comments (OTH) — 24.78% and 13/74% in 
FreeBSD and Linux respectively. 

The error categories are not mutually exclusive. For exam- 
ple, an inconsistent renaming error (TR) may also cause an 
inconsistent data initialization error (IDF) — 17.7% and 1.6% 
of the porting errors in FreeBSD and Linux respectively are 
both types IR and IDF. An inconsistent data initialization error 
(FDF) may also generate redundant operations (RDN) — 1,8% 
in FreeBSD and 2.7% in Linux. Sometimes, an inconsistent 
control flow (ICF) may also initialize the data erroneously 


TABLE II 

Examples of porting errors oh different types 


FreeBSD commit: s rc/sys/kern/sched_4b£3d 
Lug: Fix a copy-paste bug in NON-KSE case. 
Reference File: sre/ sysj kern/ sched_4bsd c 
FORRACH_K5ECRP_IN_PROC p, kg] | 
awake 0; 

FOK£ACH_THREAD_I N_GKOUP k q, td) 


ICF : Inconsistent Con I ml Flow 

c s version 1.90, Author: davidxu, Date: 2006/11/M 


Target File: sre/ sys / kern/ sched_4bsd c 
FOREACH_THREAD_3N_PROC p, td] 
awake 0; 


+ if 


if 


ke- >ke_cpticks 
continue: 


0] 


ke->kc_cpticks 

c ontinue 


— 0] 


if FEHIFT > CCPU_SHIFT] + 

ke->ke_petepu += realstathz == IOC] 

7 fixpt_t] ke->ke_ep ticks ] << 
■ : ■ i . ; 


if FSHIFT >= CCPU_S MI FT ) i 

ke->ke_petcpu += realstathz == 100] 
1 fixpt_t) ke->ke„cpticka ! << 

.1 ■ i 1 


IR- 1 . Inconsistent renamings (if identifiers 

FreeBSD commit: sre/ sys /kern/vf s_bio c * version 1.35 1, Author: phk. Date: 2003-01-05 
Log: Fix eul&pasie bug which would result in a panic 'because buffer was being biodone’ed multiple limes. 

Reference File: sre/ sys / kern/ vfs_bio c Target File: sre/ sys /kern/vf s_bio c 


+ if 


bp ->b_flags & B_CACHE ! == 0 ) I 


if 


rabp >b_flags 5 B_CACHE] ==0) I 


bp ->b_i.ocmd = BIO_READ; 
bp ->b_f lags 6= " B_INVAL; 


+ raJbp ->b_f lags = B_A5YNC; 

+ rabp ->b_f lags &= _ B_JNVALi 


+ if vp->v_type == VC HR] 

4 VOP_5 PECS TRA TEC Y vp, bp 

+ 

+ VGP_STRATECY up, bp ] ; 


If vp->v_type == VCHR] 

VOF_SPEC STRATEGY vp, bp rabp] ; 
els? 

VOP_STRATEGY vp, bp rabp! ; 


IR- 2 . In consist cm renamings of related identifiers 

Linux commit: 5edd0b946a0afebldO364a3654328bO46fb8iSa2, Author: Emmanuel Gmmbaeh, Date: 2013-11-20 
Log: Fix a copy paste error in iwl_caic_basic_ratcs which leads to a wrong calculation of CCK basic rates. 

Reference File: /wireless /iwlwifi/dvm/ rxon. c Target File: /wireless /iwlwif i/dvm/ rxon c 


+if TWI,_RATE_24M_TNDEJt < lowest_present_ of dm ] 
+ Of dm = IWL_RATE_5 4M_MASK » IWL_FIRST_ 

OFDM ^RATEj 


+ if 1WL_RATE„11M_INDEX < lowest^reaeril_trf*tocok) 
+ trfdmtefc = 1WL_RATE 11M..MASK » IWL_FIRST 
CCK _RATE; 


IDF: Inconsistent Data Flow 

Free BSD commit: s rc/sbi n/gpt/gpt- c, version 1.16, Author: marcel. Date: 2006-07-07 
Log: Fix cut- n- paste hug: compare argument s against known aliases, not the global optarg. 


Reference File: sre/ sbi n/gpt/gpt c 

main int arge, char +argv ] | 

while r.h = get. opt. arge, argv, 
switch ch] ] 

4 case ' o* : 

t if stremp optarg, "apace"] 

+ opt « FS_OPT£PACE; 


= 0) 1 


Target File: sre/ sbi n/gpt/gpt c 

pnrse_uuid const char *?., imid_f- *imid] < 

switch *s] 

+ case * e # 

4 if stremp optarg s, "efi 1 *) = C] i 

+ miid_t eti = GPT_ENT_TYPE_EFI j 

1 • > 1 1 


RDN: Redundant operations 

Linux commit: I9c2fdbabl n854f2bfcc75c326d0f4537et;2a7e, Author: John W. Lin vi lie. Dale: 2011-04-29 
Log: Looks like a copy-n- paste error, identical lines are a few lines below the ones removed, ... 

Reference File: sre/ sys/ dev/mxge/ if_mxge c Target File: sre/ sys /dev/mxge/ if_mxge c 

memset 4tsf_t.lv, CxGC, sizeof struct \ roemepy — * .tmif cr , — 4 t.a f _vq 1 , — nisoot — to f_val ] J ; 

mwif iex_ie_tYpes_tsf_tinie stamp] ] ; \ ibuttcr — i— > — ai e cot tnt_ufll) f 

y memepy ^buffer, &tsf_tlv, sizeof tsf_tlv header]); memepy 4tsf_vai, bss_desc >timfi_s tamp, sizeof tsf_val]) 

+ * buffer += sizeof tsf_tlv header) ; ; 


+ memepy *buf£er, &taf_val, sizeof tsf, val] ); 
+ *buf fer 4= sizeof tsf_val) ; 

Ported lines start with ’V\ The errors are nutrked id red, Tlie lixes are highlighled in lir'rt'ii 


(IDF) — 0.9% in FreeBSD and 1.6% in Linux. Figure 1 shows 
the distribution of the live porting error types in FreeBSD and 
Linux. 

D. Threats to Validity 

Construct Validity. We rely on the method of mining for 
porting error related keywords in the commit messages. Ft is 
possible that developers may not document porting errors m 
commit messages when fixing porting errors. 
internal Validity : We assume that porting mistakes happen due 
to poor adaptation, which may not be always true. The five 
types of common porting errors arc derived from the analyzed 
data and thus arc subject to the experimenter’s interpretation 
or categorization bias. 

External validity : We study porting errors in FreeBSD and 
Linux. Both of these projects are written in C. Thus our 
categorization of porting errors may be biased towards C 
language features. Also, we study porting bugs within a 
project boundary. Our observations may differ for cross-system 
porting errors. Though our results may not generalize to other 
systems, we believe our study of two long-surviving, large 
scale operating systems provides meaningful insights. 

ILL SPA Approach 

This section presents a semantic porting analysis algorithm, 
spa. It detects and categorizes inconsistencies in sequential 
program- flow and incorrect identifier renaming within the 
scope of a single method. Our key intuition is that semantic 
inconsistencies in porting arise due to the interactions between 
ported code and the impacted context, when the contexts differ 
between the reference and the target implementations. 

A. Overview 

An overview of the spa process is shown in Figure 2. 
To detect potential semantic inconsistencies, spa takes as 
input a reference patch that specifies the syntactic differences 
between Ref 0 ^ and Ref^^ and a target patch that specifies 
the syntactic differences between Tar^ and Tar new . We 
first extract the set of edit operations, such as insertion and 
deletion of program statements, from the target (Et ar ) and 
reference (E fe f ) patches. In step 2 of Figure 2, wc estimate 
which of the edit operations correspond to the set of program 
statements that are ported from Rcf Tieu , to Tar uetir . The AST 
nodes corresponding to the ported statements arc stored in 
the ported node pairs (PNP) set. We then compute the 
statements impacted by the ported statements in the reference 
(/ re t ) and the target (/tar ) in step 3. We use standard control 
and data dependence analyses to compute the impact of the 
ported statements on the other statements (the context). Tn 
step 4, the information computed in the previous steps is used 
to detect and categorize the potential porting inconsistencies 
according to the types presented in Section II 1 . Finally, the 
inconsistencies are reported in step 5. 

1 Type OTH {unadapted indentation nr comments) is not included in the 
scope of our diagnosis ua this requires textual or lexical analysis and does not 
involve the semantics of code fragments. 


Wc illustrate the SPA approach with an example shown in 
Table IV. The example is an adapted version of code fragments 
from FreeBSD. The code is ported from a reference method, 
f reebsd4_getf sstat, to a target method, osf l_get 
f s e tat. Lines marked with “+” are the ported code. The 
reference and target contexts are syntactically different, Tn 
osf l_getfsstat, the ported lines TS and TIC appear after 
two if statements at lines Td and T6. No such if statements 
are present in f reebsdd_getf s s tat. Also, the variable 
buf is initialized at line T12. Thus, T1 3 is in a different 
data initialization context in the target than its corresponding 
line R£ in the reference. 

The program statements that arc changed between the old 
and new versions arc highlighted in gray and the ported edits 
arc marked with in Table IV. Ported edits T£, T10 and 
TI3 in Tar.„ t ^ correspond to R4, RS and R6 in Rcf rte:i£J 
respectively. The ported edits in Tar nfcli) arc control- dependent 
on T4 and data-dependeni on Tl, T2 and T12. Also Til, 
T14, and Tl 5 are data-dependeni on the ported edits TIC 
and Tl 3. All of these statements are treated as impacted 
statements. Similarly, Rl, R2, and R8 are marked as impacted 
statements in Ref nPWI . Next, we present the details of how 
impacted ported nodes are generated. 

B. Identify the Impact of the Ported Code 

Wc present Lhc three main steps to identify the porting con- 
text that may impact or may be impacted by the ported code. 
The inputs to SPA arc two patches specifying the syntactic 
differences between Ref^d and Ref new and between Tar^ 
and Tar netu : p tar ;= A(Tar, jM , Tar„ etu ) and p re/ := A(Ref 0 / rf , 

Step I. Identify Edits in the Reference and Target: SPA 
computes the syntactic edit operations (insert, delete , move, 
or update) required on the abstract syntax trees (ASTs) to 
transform Ref^ to Ref ncttJ and Ta r oid to Tar„ c ^ [5]. This 
algorithm is inspired by Meng et al.’s edit script generation 
and extends its implementation \ 161, [171. For the code shown 
in Table TV, three edit (insert) operations are identified in the 
reference patch, and five edit operations are identified in the 
target patch. SPA uses the edit operations to generate the edited 
nodes E rft j and E tar , corresponding to Rcf TtKU; and Tar riB7fl 
respectively. An edited node e p is an AST node corresponding 
to an edited statement in a program patch p l The source lines 
corresponding to the edited nodes are highlighted using a gray 
background in Table IV. 

Step 2. Identify Ported Nodes: SPA determines the cor- 
respondence of statements in the ported code between the 
reference and the target. It is possible that when a developer 
adapts ported code from one context to another, she may also 
insert or delete additional code: hence, there may be edited 
nodes that do not correspond to ported code, A ported node 
pair is a pair of AST nodes (r, 0* where r £ E Te / and 
l £ Etar , arid r and t have a unique correspondence with 
each other This unique correspondence is determined by a 
function clone that takes two arbitrary AST nodes as input 
and outputs true if the AST node types are identical and their 
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Fig. 2. spa Workflow 
TABLE IV 

Example: adopted and simplified pokting example taken from FreeBSD 


Ret rew hUiiew 


Rl int £ reebsd4_get±sstat int flags, int bufsize Tl int os £l_getfs stat int flags, int bufsize, 



, 

ostatfs osb! \ 


osflstatfs osb! -1 

R2 


staffs buf = null; 

T2 

staffs buf = null; 

R3 


int error = 0; 

T3 

int error = Cj 




T 4 

if flags == GETFS5TAT] 




TS 

return 0; 




T6 

if flags == WAIT! 




Tl 

flags = MNT_WA1T ; 




TH 


R4 

+ 

int count = bufsize / ostat f s si zeof 3; 

T9 

+ int count = bufsize / oetat 1 e o 5 f 1 s tat f 5 

RS 

+ 

int size count * statf s . siseof J; 


siseof ; ; 

Rt 

+ 

error copyout osb, buf, size; ; 

Tl 0 

4 int size count * staffs . si zeof \ ; 




Til 

if size > 03 




T12 

but = new gtatfs ] ; 




Tl 3 

4 error copyout osb, buf, size]; 

R7 



Tld 

error copyout osb, buf, size!; 

RS 


return error; 

T 1 5 

return error; 

RS 

1 


Tie 



Edited lines in a new vension w.r.L the old version are presented in diirk background. The ported lines begin with +. The red lines are inconsistent statements 
detected by SPA. 


labels are also similar above a certain threshold based on bi- 
gram similarity [20]. A bi-gram similarity detects the ratio of 
the total number of bi-grams common between two strings 
to the average number of bi-grams representing the strings. 
The output ranges from 0 to 1. A high value indicates that 
strings arc either identical or very similar i.c., when developers 
rename identifiers after porting. We set the simi Unity threshold 
to a high value of G.8 to ensure that the matched labels are 
very similar to each other, indicating truly ported nodes. Our 
definition of ported node pair is veiy restrictive to reduce 
false positives in the later steps; we only consider one-to-one 
correspondences between a reference and a target node, and 
ignore node pairs with one-to-many correspondences. 

PNP = { (r, t)\r e E re f A t £ £tar A clone(r, *)} (1) 

PNP is a set of ported node pairs where each pair (r. f) e 
PNP represents a node ported from a reference patch to a tar- 
get patch as defined in Equation 1. Each node in the pair (r. t) 
is referred to as a ported node. For example, the nodes corre- 
sponding to statements R5 and Tl 0 in Table IV have the same 
AST node type (declaration) and label (size= count 
+ statfs size ] ), hence d<me(f£5, T10) is true and 
(R5,T1C) is a ported node pair . However, no AST nodes in 
E tp j are syntactically similar to the AST node corresponding 
to statement Til in E tar . Therefore, Til ts not a member 
of any ported node pairs. All of the statements identified with 
"+** in Table JV have corresponding AST nodes in PNP. 


Step 3. Identify Impacted Nodes: Next, spa identifies the 
AST nodes in Ref ncw and that are either impacted by 

or impact the semantics of the ported nodes. The impacted 
nodes include all of the ported nodes, and the subset of 
the context nodes that may affect the porting semantics or 
may be affected by the ported nodes. SPA identifies the 
impacted nodes using static intra-procedural data- and control- 
dependence analyses [22] with respect lo the ported nodes. 
This step bears resemblance lo how Sydit identifies the context 
of edit operations using control and data analysis [16]. 

Data Dependence. Statement Si is data dependent on S\, if 
Si defines a variable v and uses v, such that there exists 
a path from S\ to S 2 along which v is not killed (redefined). 
Control Dependence. Statement S 2 is control dependent on 
Si, if execution of S 2 depends on the decision made at Si. 

Definition 3J: A program dependence graph, PDG 
( DN , DE\ is a set of vertices DN representing program 
statements, and a set of edges, DE Q DN * DN, representing 
the control and data dependencies between statements. 

A control dependence graph (CDG) is a sub -graph of a 
PDG, where the edges represent control dependencies between 
vertices (program locations), whereas a data dependence graph 
(DDG) is a sub-graph of the PDG where the edges represent 
data dependencies between vertices. 

In SPA, we construct the PDG vertices using AST nodes, 
each of which represents an atomic program statement, and the 
edges correspond to the control and data dependences between 
statements. The impacted nodes in Rcf nftm and Tar„ ftU , arc 


derived from their respective program dependence graphs, 
PDGpef and PDGt ar . Given a set of vertices mapping to 
ported nodes V p Q Rcf TteuJ and the PDG for Rcf neW y we 
generate the impacted nodes I re j . The impacted nodes map 
to vertices in the PDG reachable from V p along the control 
and data dependence edges, Similarly, we find ft ar from V p Q 
Tar mmr . The vertices corresponding to statements TE and Tl 
in Table IV are not control or data dependent on ported code, 
hence they are not in the impact set. 

G Detect and Categorize Porting Inconsistencies 

SPA categorizes porting inconsistencies according to the 
types presented in Section II, using ported node pairs, PNP , 
impacted nodes, 1 re f and l Lar , and the data- and control- 
dependence information computed in the previous steps. 

ICF: Inconsistent Control Flow, To detect TCF inconsis- 
tencies, spa performs the following steps: 

• Given a pair of ported nodes, (r, t) t we construct isomor- 
phic sub-graphs starting from r in CDG^ and from l in 
CDG\ax . A pair of vertices (i/ ri Vt ) f where v r e CDGret 
and vt £ CDGtar, is isomorphic if (i) the vertex labels 
have identical AST types and similar syntactic structures 
(e.g., nodes ‘a = a + b 5 and J x = y + z* have same AST 
type and syntactic structure), and (ii) the vertices have the 
same relative position with respect to the ported nodes. 
We extend Komondoor et al.’s program slicing based 
clone detection algorithm [14] to construct the isomorphic 
sub-graphs. 

• Detect inconsistent nodes in the context with respect to 
(r, t ) and add them to the respective inconsistent sets, 
JCref mid JCtar ■ A node in I rR / (l^ r ) is inconsistent if 
it is reachable from r it) in C7X?ref (CDOtar), hut it 
is not contained in the respective isomorphic subgraph. 

The nodes corresponding to statements HA and T9 in Ta- 
ble IV are a ported node pair HA is not control dependent 
on any node within the method body, while T9 is control 
dependent on 2 4 along the true control edge. 7 ? 4 is then added 
to /Qar , as it is reachable from ported node T9 although it 
does not have a corresponding node in the reference. 

IR; Inconsistent Renaming, To detect this inconsistency, 
we first construct the isomorphic sub-graphs on CDG re f and 
CDG\ar with respect to the ported node pairs, as described 
earlier For each isomorphic node pair in CDG rp / and CDG /f7ir , 
we extract the corresponding identifiers, i.c., variables, types, 
and method names, and align them based on their syntactic 
similarity. For example, given two isomorphic nodes with 
labels *a = b + c* and l x = y + z\ variable a is aligned with 
x y variable b is aligned with y f and variable c is aligned with 
z. We rank each identifier mapping with a confidence value 
based on the number of limes the mapping is encountered. 
Using these alignments, we generate two identifier maps: 
(a) JdMap re fy a map from each reference identifier to its 
corresponding target identifiers, and (b) IdMaptw* a map 
from each target identifier to its corresponding reference 
identifiers. If a one— to— many or a many-to-one relation is 
found in the maps, then an IR inconsistency is detected. We 


consider identifier mappings with the lowest (or, all when 
there is a tie) confidence values as the incorrect mappings, 
and characterize the vertices in the isomorphic sub- graphs 
corresponding to the incorrect mappings as inconsistent. 

Table V shows an example of ldMap re f generated 
from Table IV. SPA generates a map entry (osfl staffs 
— *o staffs) from the method signatures and (osflstatfs 
f 1 staff s) from the isomorphic nodes RA and T9 . 
Since the reference variable osflstatfs maps to two target 
variables, osflstatfs and ostatfs, an IR inconsistency 
is detected. 

TABLE V 

Identifier Mapping prom Table IV 


fciirrmrpJiic Nodes 

Identifier Map (EdMa.p re f } 

(RIpTI) 

flags — ¥ flags (I), buf size — ^ bufsize (1) 
osflstatfs ■> ostatfs (I), osb -> osb (1) 

(R4,T9) 

count — > count (1) , bufsize - > bufsize (2> 
osflstatfs — * ostatfs (if nsf I stalls (!) 

The inconsistent mapping is highlighted in rod. 


Sometimes developers forget to update related identifiers, 
as shown in the IR-2 example in Table IT. To detect this 
inconsistency, we carry out a similar process at the granularity 
of tokens as opposed to identifiers after separating identifier 
names using separators or a camel case convention. 

For example, OFDM is mapped to CCK once, while of dm is 
mapped to of dm twice. 

TDF: Inconsistent Data F!ow t , IDF inconsistency detection 
is similar to our ICF diagnosis but uses data dependence 
graphs (DDG) instead of CDGs. 

in Table IV, RE and T13 arc statements corresponding 
to a ported node pair. In Lhc reference implementation, RE 
is data dependent on R2 for the definition of variable buf. 
However, statement T13 in the target implementation is data 
dependent on the definition of buf at T 2 and T12. Although 
R2 and T2 are isomorphic, the dependence on T12 creates an 
additional data dependence in the target implementation that 
is not present in the reference implementation. Therefore, the 
node corresponding to T12 is added to IQ flr , 

Similarly, R5 and TIC are statements corresponding to a 
ported node pair, and both define variable size. However, in 
the reference implementation, size is used at statement RE, 
while in the target implementation, s ize is used at statements 
TIT, Tl 3, and TT 4* Although RE and Tl 3 are isomorphic, 
Til and Tl 4 create additional data dependences in the 
target implementation that are not present in the reference 
implementation. Therefore, the nodes corresponding to Til 
and Tl 4 arc added to lQ fir . 

RDN: Redundant operations. To delect redundant ported 
code, spa checks for pairs of vertices in C7X?tar that have 
identical labels and types and that are control dependent on 
the same impacted vertex, Note that we only look for an RDN 
inconsistency in Tar ne ^, In Table VT, statements Tl 3 and Tl 4 
in the target implementation have identical syntax, and both 
are control dependent on the impacted statement T4. Thus, 
SPA characterizes the nodes corresponding to statements T13 
and T14 as redundant. 


Tabic VI shows the nodes that arc inconsistent with respect 
to the ported code in Table IV, along with their corresponding 
inconsistency types. 

TABLE VI 

Cli AR ACTOR IZATION OF PORTING INCONSISTENCIES IN TAHI R JV 


iiKOiisiueni Control Dependent Nodes 
<ICF) 

T4 

Inconsistent Utenlilici Renaming 

UR) 

T9 (identifier: ostalfs) 

Inconsistent Data Dcpciidcnl Nodes 
(IDF) 

T11,TI2,TI4 

Redundant Nodes 
(RDN) 

T13.TI4 


IX Implementation 

SPA is implemented using several existing tool chains. First, 
we extend LASE [ I7| and Sydit [16], which extract edit scripts 
to automate systematic program changes, spa also extends the 
control and data dependence analysis of Sydit to identify the 
impact of ported nodes in the reference and target programs 
respectively. The dependency analysis uses crystal [2], a static 
analysis framework to analyze Java source code. 

I V, Experimental Results 

In this section, wc present an empirical evaluation of 
SPA*s ability to detect and diagnose porting inconsistencies 
in FreeBSD, Linux, Eclipse CDT, and Mozilla. Wc compare 
the accuracy of the results computed by spa with the results 
computed by two stale-of-lhe-art tools, Jiang el al.'s clone 
related error detection tool (9] and DejaVu [6], Jiang el al. 
model the context of ported code in terms of their immediate 
preceding lines, even if the context does not have any control 
or data dependence on ported code. Though DejaVu extends 
Jiang et aL by relining clone detection results to determine 
ported code, it still suffers from the same limitation as Jiang 
ct al,, as the context is identified based on physical location 
proximity riot on control and data flow dependences with the 
ported code. 

We also compute spa's accuracy to characterize potential 
inconsistencies based on the categories defined in Section II. 
To this end we investigate two research questions; 

• KQL Can SPA accurately detect porting inconsistencies? 

• RQ2, Can SPA accurately categorize different types of 
porting inconsistencies? 

A. Study Subjects 

To evaluate SPA, we use porting examples from four dif- 
ferent projects: FreeBSD, Linux, Eclipse CDT, and Mozilla. 
Except for Mozilla, the reference and target patches for each 
artifact are computed using REPERTOIRE [18], From these, we 
randomly select (a) 20 examples from FreeBSD, (b) 10 exam- 
ples from Linux, (c) 60 examples from Eclipse CDT that are 
ported from CDT versions CDT_?_C to CDT_8_1_1 , and (d) 
42 Mozilla examples from the annotated data set of copy -paste 
errors provided by Gabel et al. [6]. The FreeBSD and Linux 
artifacts are from the data sets used in Section JL To retrieve 
a large number of porting instances, wc choose CDT_2_C 
and CDT_E_1_1 versions which arc 98 months apart. The 


Mozilla examples were obtained from DcjaVu's annotated data 
set 2 , because Pejavu is not an open-source tool. In the Mozilla 
examples, wc treat an entire program as a program patch 
whose old version is empty, because spa works on program 
patches as opposed to entire programs. We use a combination 
of commit logs and manual inspection to annotate the types of 
potential porting errors in selected target patches of the subject 
artifacts, 

The current version of SPA analyzes only Java source code, 
so we convert the C and C++ porting examples from Linux, 
FreeBSD and Mozilla examples using a free C/C++ to Java 
code converter 1 1 ]. 


B. Study Methodology 

We measure SPA's capability to detect and categorize port- 
ing errors in terms of precision and recall. For each error 
type e defined in >Section II, suppose that S is the set of 
examples where a porting inconsistency is detected by SPA and 
its error type is reported by SPA to be c. Suppose that A is the 
set of examples where a porting inconsistency is manually 
determined to be of type e. Then the precision and recall 
of spa in categorizing porting inconsistencies are defined as 
follows: 

Precision, the percentage of porting inconsistencies of type 

r . u i , , . \ACiS\ 

e found by spa that are also known to be type e re.* — - 

Pi 

Recall, the percentage of the known inconsistencies of type 


\Af\s\ 

\A\ 


e, which are also found to be type e by SPA, i.e., 

To evaluate the accuracy of SPA's error detection capability, 
we calculate precision and recall without considering individ- 
ual error types. 


C Study Results and Discussions 

RQL Can SPA accurately detect porting inconsistencies? 

Wc compare spa's ability to detect porting inconsistencies 
with Jiang el aJ.’s clone related bug detection algorithm |9j 3 
and DejaVu 16). Table VII summarizes the comparison of spa 
with Jiang et al. using the Eclipse CDT artifact and with 
DejaVu on the Mozilla examples, The first row represents the 
number of potential porting errors, regardless of error type* 
that were detected by the respective tools, We also report the 
number of false positives, false negatives, precision, and recall 
of the error detection capability of each tool. The results of our 
study show that SPA improves the error detection capabilities 
considerably over Jiang ct al, SPA improves the precision from 
48% to 65%, and marginally improves the recall from 87% to 
90%. 

Out of the 42 randomly selected examples from the DejaVu 
annotated Mozilla data set, our manual inspection shows that 
only 25 of them contain true porting inconsistencies. Thus, 
DejaVu *s precision is 59,52%. For the same data set, SPA 
reports inconsistencies for 34 examples. Thus, SPA T s precision 


2 h ltp://w wwcsif.cs. ueda v i s, cdu/ - gahe l/research/dej avu _mozt I la./i p 
J Jiang et al/s clone detector IXvkard and the associated clone bug detector 
were downloaded from hlips:%i lhub.com/sky hu ver/Deckard_ 



in detecting errors on the Mozilla data set is 73.53% as 
shown in Table VII. Because this data set does not contain 
any examples where DejaVu fails to report an inconsistency, 
we are unable to assess the number of false negatives for 
either DejaVu or spa. Furthermore, because our comparison 
is limited to the data set where DejaVu already found porting 
inconsistencies, the precision of spa could be lower if the 
comparison was done on a different data set. 

We find that SPA reduces false positives over Jiang et al.’s 
tool and DejaVu in 14 and 8 cases respectively. For example, 
consider a case when a variable is initialized differently in the 
reference and target contexts. Later, both the reference and 
the target contexts reinitialize the variable in the same manner 
before using it in the ported code. In this case, SPA correctly 
does not report any inconsistency unlike other tools, because 
there is no data flow between the inconsistent initialization 
and the ported code. 

The cases where all three tools incorrectly detect inconsis- 
tencies include porting code from a while context to a for 
context, porting code from an if context to a switch-case 
context, etc. 


TABLE VII 

Inconsistency detection results for Ecltpse CDT and Mozilla 



Eclipse CDT 
SPA Jiang’s TcmiI 

Mozilla 

SPA DejaVu 

Etetectcd 

43 

56 

34 

42 

False Positive 

15 

29 

9 

17 

False Negative 

3 

4 

- 

- 

Precision 

65.11% 

48.21% 

73,53%* 

59.52%* 

Recall 

90.32% 

87.09% 

- 

- 


*The comparison is done on Ihe daia set where DejaVu already reported 
porting errors. 


RQ2. Can SPA accurately categorize different types of 
porting inconsistencies? 

Table VIII shows the precision and recall for SPA in 
categorizing potential porting errors in FreeBSD and Linux 
for the error types ICF, IR-l, IR-2, IDF, and RDN. SPA 
has precision ranging from 50% for ICF to 100% for RDN, 
The recall for spa ranges from 62.5% for RDN to 100% for 
ICF and IDF w.r.L the porting errors reported in the version 
histories (see 2 nd row in Table VTTT). Version history based 
evaluation is often conservative in the sense that when there is 

TABLE VIII 

Inconsistency characterization results on FreeBSD and Linux 



ICF 

IR-l 

IR-2 

IDF 

RDN 

sea Detected 

10 

ft 

6 

9 

5 

prom commit logs 

5 

ft 

5 

6 

8 

Precision 

50% 

87,5% 

66.66% 

66.66% 

100% 

Recall 

100% 

87.5% 

80% 

100% 

62.5% 

Manually annotated 

7 

ft 

5 

8 

8 

Precision 

70% 

87.5% 

66.66% 

87.5% 

100% 

Recall 

100% 

87.5% 

80% 

100% 

62.5% 


no mention of porting errors in the commit messages, it docs 
not necessarily imply the absence of porting inconsistencies. 
To overcome this limitation, we compare SPA results against 
the type and location of inconsistencies that were identified 
by manual inspection of individual patches. The comparison 
against this annotated set is shown in Rows 5-7 in Table VITL 

Table IX summarizes the number of porting inconsistencies 
for each error type, and the precision and recall based on the 
manually identified error types for Eclipse CDT and Mozilla 
data sets. In Eclipse CDT, SPA detects and characterizes 62 
porting inconsistencies — 77% arc ICF, 16% arc IR-l, 12% arc 
IR-2, and 40% arc IDF. In Mozilla, SPA detects 54 instances 
of porting inconsistencies, of which 28%, 22%, 7%, and 43% 
are of type ICF, IR-J, IR-2, and IDF respectively. No RDN 
inconsistency is reported in these two data sets. On average, 
spa achieves 58% precision and 92% recall in Eclipse CDT, 
and 63% precision and 100% recall in Mozilla data set. 

In detecting ICF inconsistencies, spa may report false 
positives when, for example, code is ported from a for 
block to an equivalent while block, because these two loops 
have different syntaxes, SPA may generate a false positive of 
type IR-l when the relative ordering of program variables is 
changed, but the semantics remain unchanged, c.g., a statement 
x = x+y in the reference implementation is modified Lo x = y+x 
in the target. When characterizing IR-2 inconsistencies, SPA 
may report false positives when, for instance, the names cannot 
be tokenized properly due to inconsistent naming conventions. 
For example, if a ported node pair contains the variables 
fooBar and foobar, spa correctly splits the first one into 
f oo and Bar but does not split foobar. Thus, spa misaligns 
the tokens. In the case of IDF inconsistencies, SPA may report 
a false positive when, for example, a variable is declared 
and defined in a single program statement in the reference, 
but the declaration and definition arc separate statements in 
the target. Here, SPA reports an inconsistency because the 
AST node types are different (declaration versus assignment). 
With respect to false negatives, spa is not able to detect 
redundancies that require a deeper semantic analysis, such as 
redundant locking calls in a concurrency construct. 

In spite of these limitations, there are some suc- 
cess stories* A bug was fixed in FreeBSD source file: 
e rc/ sys /dev/mxge / if_mxge c, version L27, with a 
commit message: * Fix an mhuf leak caused by a cut depaste 
bug where the small rings mbufs were never freed, but the 
big ring was freed twice* . A buffer rx_big was mistakenly 
freed twice. SPA delects this bug successfully and categorizes 
it as an RDN bug, which is also confirmed by the developers 
and took 26 releases and 432 days to detect and fix. Jiang el 
al s s tool is not able lo detect this bug since it does not handle 
redundancy. 

Another identifier renaming bug was fixed in Linux 
at commit id 2b9460. Code was ported from method 
mlx4 ,ib post_,send to mix 4 „ib po 5 t.^reev, but 
variable sendLcq was never updated to recv_cq. This bug 
caused a queue overflow in the infiniband driver module 
(a high-speed network driver) and took 974 days to fix. SPA 



TABLE IX 

SPA INCONSISTENCY DIAGNOSIS RESULTS 



IGF 

IR-I 

Ellipse cm 
IR-2 

IDF 

Total 

JCF 

IK-1 

Morill a 

IR-2 

IDF 

Total 

SPA Deleted 

33 (53%) 

7 (11%) 

5 (8%) 

17 (27%) 

62 

15(28%) 

1 2 (22%) 

4 (7%) 

23 (43%) 

54 

Annotated 

23 

7 

4 

5 

39 

13 

6 

2 

13 

34 

False Positive 

12 

2 

2 

12 

26 

2 

6 

2 

ID 

20 

False Negative 

2 

2 

1 

0 

3 

o 

0 

0 

0 

0 

Precision 

63.63% 

71.43% 

60% 

29.41% 

58.06%' 

86.66% 

50.0% 

50.0% 

56.52% 

62.%% 

Recall 

9L30% 

71.43% 

75% 

100% 

32.3 1% 

100% 

100% 

100% 

100% 

100% 


wc tin not detect any RDN incon^Ktcncy henu. 


successfully detected this error. Other tools were unable to 
detect this error because they do not check whether related 
variables were updated consistently (IR-2), 

V. RELATED WORK 

Jucrgcns cl al. IJQj conduct an empirical study on Lhe 
impact of inconsistent clones in a code base. They detect 
inconsistent clones using a suffix-tree based, lexical clone 
detection algorithm. Their interviews with developers confirm 
that inconsistencies in the found clones are indeed bugs and 
report that * nearly every second, unintentional inconsistent 
changes to clones lead to a fault: 

Chou et al. show that porting is an important source of 
bugs in operating systems [4]. In 65% of the ported code, at 
least one identifier is renamed, and in 27% cases at least one 
statement is inserted, modified, or deleted [151. An incorrect 
adaptation of ported code often leads to porting errors [9j. This 
observation is aligned with our findings — where we find 113 
and J 82 porting errors by mining FreeBSD and Linux version 
histories respectively. 

Using CP-Miner, a mining based clone detection tool, Li et 
al. find 28 and 23 errors in Linux and FreeBSD respectively, 
which developers created by forgetting to rename identifiers 
consistently after copy and paste 115]. Jablonski ct al. |7] 
detect similar errors by tracking copy-paste code within an 
Eclipse IDE and by comparing the corresponding AST rep- 
resentations. Though the results of these studies arc aligned 
with out findings of IR inconsistencies, we observe that such 
inconsistent renaming is a special case of a more general cate- 
gory of porting inconsistencies— forgetting to adapt identifiers 
according to the target context (IR- 1 and TR-2). 

SPA detects a broader scope of inconsistent renamings by 
tokenizing function names, file names, and identifier names us- 
ing a camel case naming convention and mapping correspond- 
ing tokens. Our algorithm detects an inconsistency when a 
token in one context maps to multiple tokens in the other con- 
text. For example, when code is ported from Export 'ava 
to Import 'ava, SPA cheeks whether all names related 
to export arc updated to import. 

Jiang et al. show that an inconsistent context can also 
cause porting errors [ 9 ]. However, their definition of context is 
limited to the innermost control flow construct surrounding the 
cloned code. They identify syntactic clones using AST level 
similarity |8|, and then detect i neons i sic ncies by comparing 
the contexts. While their diagnosis partially overlaps with our 


categorization of porting errors (TCF and IR-I), they do not 
report renaming errors on groups of identifiers (IR-2), data 
flow inconsistencies (IDF), or redundant operations (RDN). 
Also, their error detection analysis is purely syntactic, and 
thus suffers from a higher rate of false positives than our 
semantic, control- and data-flow based approach, spa reports 
17 percentage point belter precision and 3 percentage point 
more recall in detecting porting inconsistencies than Jiang et 
al. on the Eclipse COT data set. 

DejaVu extends the work by Jiang et af. by using several 
filtering heuristics, such as assessing textual similarity and 
pruning non-cloned contexts, to improve Us precision [6], 
As shown in our evaluation, Spa’s error detection still out- 
performs DejaVu with 14 percentage point better precision. 
Also, DejaVu does not report potential error types, while SPA 
automatically characterizes the detected inconsistencies to help 
developers detect porting errors, 

VI. Conclusion 

When porting code from one context to another, the se- 
mantics of the ported code often change due to differences 
in the surrounding contexts. Developers may overlook such 
subtle differences, inadvertently creating a porting error. By 
analyzing the version histories for Linux and FreeBSD, wc 
identify five common categories of porting errors, and then use 
this categorization to design SPA, a novel algorithm to delect 
and characterize semantic inconsistencies in ported code. Our 
evaluation of spa on several large open-source code bases 
shows that spa can delect porting inconsistencies with high 
precision and recall, and it outperforms the precision of two 
state-of-the-art techniques with 14 to 17 percentage point. 

As part of our future work, we plan to investigate methods 
for further reducing false positives, such as comparing the 
dynamic program behaviors of ported code. Based on the 
observation that not all inconsistencies lead to an error, we 
also plan to investigate heuristics to rank the inconsistencies 
based on their error potential. Finally, we plan to integrate SPA 
with an integrated development environment so that developers 
can detect porting inconsistencies during the porting process. 
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